Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-8446: MCO-503: daemon: have a special path to sync in certs #3575

Merged
merged 4 commits into from
Mar 15, 2023

Conversation

yuqi-zhang
Copy link
Contributor

@yuqi-zhang yuqi-zhang commented Feb 28, 2023

Have a new routine that watches for controllerconfig changes to read the latest kubelet ca bundle, and lay that down to disk directly. File writing and validation will skip over it.

This is still WIP but the functionality should work. You can try this by pausing the pools and forcing a rotation of any of the certificates in the bundle. Although it won't be instant, the MCD will sync it once the safety controllerconfig sync happens.

I tried doing this via an additional configmap but it feels like this is a bit cleaner, and allows the MCD to have limited permissions in what it can read.

  • What I did
  1. Add a new controllerconfig watcher + sync to directly write the newest kubelet ca bundle directly to disk
  2. Remove MCD file writing and verification for that file
  3. Add a test to write the file
  4. Remove (now incorrect) cert metric and alert
  5. Add additional path in the MCS to serve the latest bundle as well
  • How to verify it
  1. Install a 4.13 cluster without this change and pause pools, in ~24 hours you will get kubeletdown critical alerts. With this change, you should no longer see it
  2. Faster way to verify: pause pools, manually expire a cert (e.g. kube-apiserver-to-kubelet-signer), and see the change propagate to disk without unpause

Note that this is only a temporary 4.13 solution, with future steps outlined in https://issues.redhat.com/browse/MCO-499 for a more comprehensive solution

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 28, 2023
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Feb 28, 2023

@yuqi-zhang: This pull request references MCO-503 which is a valid jira issue.

In response to this:

Have a new routine that watches for controllerconfig changes to read the latest kubelet ca bundle, and lay that down to disk directly. File writing and validation will skip over it.

This is still WIP but the functionality should work. You can try this by pausing the pools and forcing a rotation of any of the certificates in the bundle. Although it won't be instant, the MCD will sync it once the safety controllerconfig sync happens.

I tried doing this via an additional configmap but it feels like this is a bit cleaner, and allows the MCD to have limited permissions in what it can read.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 28, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 28, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 28, 2023
@yuqi-zhang
Copy link
Contributor Author

/test all

@yuqi-zhang yuqi-zhang marked this pull request as ready for review March 1, 2023 00:44
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 1, 2023
@yuqi-zhang
Copy link
Contributor Author

Will update the commit messages, but the functionality should be ready for review. Seems to (mostly) be working locally, let's see how it fares against CI

pkg/daemon/certificate_writer.go Show resolved Hide resolved

kubeAPIServerServingCABytes := controllerConfig.Spec.KubeAPIServerServingCAData

if err := writeFileAtomicallyWithDefaults(caBundleFilePath, kubeAPIServerServingCABytes); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we're rewriting the file even if it hasn't changed, but something else in the controllerconfig did? That may be OK for now. But I am sure if this approach was to be extended at all beyond this we'd want to enhance things.

I've been thinking about things like this in the context of containers/bootc#22 and if we represented this certificate as a bootc configmap, the design I have in mind for the bootc configmap support would use ostree as a backend, and inherently hash its inputs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can certainly make it more nuanced. This is more of a hacky way to get us past the issues highlighted in https://issues.redhat.com/browse/MCO-499, with a better way to do it in 4.14

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 1, 2023

@yuqi-zhang: This pull request references MCO-503 which is a valid jira issue.

In response to this:

Have a new routine that watches for controllerconfig changes to read the latest kubelet ca bundle, and lay that down to disk directly. File writing and validation will skip over it.

This is still WIP but the functionality should work. You can try this by pausing the pools and forcing a rotation of any of the certificates in the bundle. Although it won't be instant, the MCD will sync it once the safety controllerconfig sync happens.

I tried doing this via an additional configmap but it feels like this is a bit cleaner, and allows the MCD to have limited permissions in what it can read.

  • What I did
  1. Add a new controllerconfig watcher + sync to directly write the newest kubelet ca bundle directly to disk
  2. Remove MCD file writing and verification for that file
  3. Add a test to write the file
  4. Remove (now incorrect) cert metric and alert
  5. Add additional path in the MCS to serve the latest bundle as well
  • How to verify it
  1. Install a 4.13 cluster without this change and pause pools, in ~24 hours you will get kubeletdown critical alerts. With this change, you should no longer see it
  2. Faster way to verify: pause pools, manually expire a cert (e.g. kube-apiserver-to-kubelet-signer), and see the change propagate to disk without unpause

Note that this is only a temporary 4.13 solution, with future steps outlined in https://issues.redhat.com/browse/MCO-499 for a more comprehensive solution

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@yuqi-zhang
Copy link
Contributor Author

added a comment and annotation for easier troubleshooting if something were to go wrong

@yuqi-zhang
Copy link
Contributor Author

gcp-op failed since the actual on-disk state was not yet synced, so added a wait there.

I will clean up the commit messages and squash once we decide if this is acceptable as a workaround for 4.13

dn.ccQueue.AddRateLimited(key)
}

func (dn *Daemon) syncControllerConfigHandler(key string) error {
Copy link
Member

@cheesesashimi cheesesashimi Mar 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Will this run outside of the usual dn.update() process? Also, could it potentially write different content to caBundleFilePath than the MachineConfig contains?

I ask these questions because of Config Drift Monitor. If the file is part of the MachineConfig, Config Drift Monitor will expect its contents to match. It sounds like at a minimum, Config Drift Monitor should ignore caBundleFilePath since we're managing it here.

EDIT: Since you added a guard clause to checkV3Files() to ignore it, that will handle the Config Drift Monitor case since Config Drift Monitor basically runs checkV3Files() in response to any filesystem events that target files specified in the MachineConfig.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, i explicitly added 2 clauses in validation and writes so this would happen completely separately. This does of course cause 2 issues implicitly:

  1. the config drift monitor keeps triggering whenever the file changes, see nothing changed, might be confused, but then lets it carry on
  2. the node still says it updates (rebootless), performs the update, but doesn't touch the file (if its the only change), so it technically does a fake update

I can try to make these processes better but they don't block anything I believe

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine as-is. You are correct about it not blocking anything: Config Drift Monitor runs in a separate Goroutine from the rest of the MCD and from that Goroutine, it calls validateOnDiskState() whenever it detects a filesystem event for the files it keeps track of.

Additionally, running validateOnDiskState() is inexpensive enough that even if it runs superfluously in response to writing to caBundleFilePath, it won't have any noticeable impact.

Have a new routine that watches for controllerconfig changes to read the
latest kubelet ca bundle, and lay that down to disk directly. File
writing and validation will skip over it.

When successful, the daemon will update node annotations to indicate the
latest resourceVersion of the controllerconfig it read from, as well as
a log. Note that due to caching, the log will appear a few times, but
the additional writes should be no-ops.

In the future, we should have a more comprehensive method to manage cert
rotation. For now, this will allows us to bypass monthly certificate
out-of-date issues.
Have the MCS always serve the latest kubelet config ca bundle to match
the MCD. This will ensure any nodes joining the cluster always has the
latest certificates, even if the pool is otherwise not updated or
paused.
Soft revert of
openshift#2802.

This should no longer be needed since the MCD will always sync the cert
bundle to disk. If things go wrong, the MCD should degrade.
Add a e2e test for rotating certs for a paused pool
@yuqi-zhang
Copy link
Contributor Author

Reorganized the commits to be more logical in separation and ordering. Also modified logic to only write if we have not acted on the particular resourceVersion

@yuqi-zhang
Copy link
Contributor Author

/retest-required

@yuqi-zhang yuqi-zhang changed the title MCO-503: daemon: have a special path to sync in certs OCPBUGS-8446: MCO-503: daemon: have a special path to sync in certs Mar 7, 2023
@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Mar 7, 2023
@openshift-ci-robot
Copy link
Contributor

@yuqi-zhang: This pull request references Jira Issue OCPBUGS-8446, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @rioliu-rh

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Have a new routine that watches for controllerconfig changes to read the latest kubelet ca bundle, and lay that down to disk directly. File writing and validation will skip over it.

This is still WIP but the functionality should work. You can try this by pausing the pools and forcing a rotation of any of the certificates in the bundle. Although it won't be instant, the MCD will sync it once the safety controllerconfig sync happens.

I tried doing this via an additional configmap but it feels like this is a bit cleaner, and allows the MCD to have limited permissions in what it can read.

  • What I did
  1. Add a new controllerconfig watcher + sync to directly write the newest kubelet ca bundle directly to disk
  2. Remove MCD file writing and verification for that file
  3. Add a test to write the file
  4. Remove (now incorrect) cert metric and alert
  5. Add additional path in the MCS to serve the latest bundle as well
  • How to verify it
  1. Install a 4.13 cluster without this change and pause pools, in ~24 hours you will get kubeletdown critical alerts. With this change, you should no longer see it
  2. Faster way to verify: pause pools, manually expire a cert (e.g. kube-apiserver-to-kubelet-signer), and see the change propagate to disk without unpause

Note that this is only a temporary 4.13 solution, with future steps outlined in https://issues.redhat.com/browse/MCO-499 for a more comprehensive solution

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Mar 7, 2023
@openshift-ci openshift-ci bot requested a review from rioliu-rh March 7, 2023 03:57
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 832de9e and 2 for PR HEAD e68a59f in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 7b3939c and 1 for PR HEAD e68a59f in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 5cf66e8 and 0 for PR HEAD e68a59f in total

@openshift-ci-robot
Copy link
Contributor

/hold

Revision e68a59f was retested 3 times: holding

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 10, 2023
@openshift-bot
Copy link
Contributor

/jira refresh

The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity.

@openshift-ci-robot
Copy link
Contributor

@openshift-bot: This pull request references Jira Issue OCPBUGS-8446, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @rioliu-rh

In response to this:

/jira refresh

The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Contributor

/bugzilla refresh

The requirements for Bugzilla bugs have changed (BZs linked to PRs on main branch need to target different OCP), recalculating validity.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 10, 2023

@openshift-bot: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

Retaining the bugzilla/valid-bug label as it was manually added.

In response to this:

/bugzilla refresh

The requirements for Bugzilla bugs have changed (BZs linked to PRs on main branch need to target different OCP), recalculating validity.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sinnykumari
Copy link
Contributor

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 13, 2023
@yuqi-zhang
Copy link
Contributor Author

/test e2e-aws-ovn-upgrade

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 10af502 and 2 for PR HEAD e68a59f in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 9cc2333 and 1 for PR HEAD e68a59f in total

@sinnykumari
Copy link
Contributor

/refresh

@sinnykumari
Copy link
Contributor

/retest-required

@sinnykumari
Copy link
Contributor

skipping optional tests
/skip

@sinnykumari
Copy link
Contributor

Looks to me that e2e-aws-ovn-upgrade is broken https://issues.redhat.com/browse/TRT-897. Running other upgrade job to get better signal
/test e2e-gcp-upgrade
/test e2e-aws-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 15, 2023

@yuqi-zhang: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn e68a59f link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-alibabacloud-ovn e68a59f link false /test e2e-alibabacloud-ovn
ci/prow/okd-scos-e2e-gcp-ovn-upgrade e68a59f link false /test okd-scos-e2e-gcp-ovn-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@sinnykumari
Copy link
Contributor

e2e-gcp-upgrade test has passed. Overriding e2e-aws-ovn-upgrade test which is red for a while (which we don't know the root cause yet, possibly https://issues.redhat.com/browse/TRT-897 ?) to get this merged.
/override e2e-aws-ovn-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 15, 2023

@sinnykumari: /override requires failed status contexts, check run or a prowjob name to operate on.
The following unknown contexts/checkruns were given:

  • e2e-aws-ovn-upgrade

Only the following failed contexts/checkruns were expected:

  • ci/prow/e2e-alibabacloud-ovn
  • ci/prow/e2e-aws-ovn
  • ci/prow/e2e-aws-ovn-upgrade
  • ci/prow/e2e-gcp-op
  • ci/prow/e2e-gcp-ovn-rt-upgrade
  • ci/prow/e2e-gcp-upgrade
  • ci/prow/e2e-hypershift
  • ci/prow/images
  • ci/prow/okd-images
  • ci/prow/okd-scos-e2e-aws-ovn
  • ci/prow/okd-scos-e2e-gcp-ovn-upgrade
  • ci/prow/okd-scos-images
  • ci/prow/unit
  • ci/prow/verify
  • pull-ci-openshift-machine-config-operator-master-e2e-alibabacloud-ovn
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-ovn-rt-upgrade
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade
  • pull-ci-openshift-machine-config-operator-master-e2e-hypershift
  • pull-ci-openshift-machine-config-operator-master-images
  • pull-ci-openshift-machine-config-operator-master-okd-images
  • pull-ci-openshift-machine-config-operator-master-okd-scos-e2e-aws-ovn
  • pull-ci-openshift-machine-config-operator-master-okd-scos-e2e-gcp-ovn-upgrade
  • pull-ci-openshift-machine-config-operator-master-okd-scos-images
  • pull-ci-openshift-machine-config-operator-master-unit
  • pull-ci-openshift-machine-config-operator-master-verify
  • tide

If you are trying to override a checkrun that has a space in it, you must put a double quote on the context.

In response to this:

e2e-gcp-upgrade test has passed. Overriding e2e-aws-ovn-upgrade test which is red for a while (which we don't know the root cause yet, possibly https://issues.redhat.com/browse/TRT-897 ?) to get this merged.
/override e2e-aws-ovn-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sinnykumari
Copy link
Contributor

/override ci/prow/e2e-aws-ovn-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 15, 2023

@sinnykumari: Overrode contexts on behalf of sinnykumari: ci/prow/e2e-aws-ovn-upgrade

In response to this:

/override ci/prow/e2e-aws-ovn-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@yuqi-zhang
Copy link
Contributor Author

Thanks!

/cherry-pick release-4.13

@openshift-cherrypick-robot

@yuqi-zhang: once the present PR merges, I will cherry-pick it on top of release-4.13 in a new PR and assign it to you.

In response to this:

Thanks!

/cherry-pick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit c1e1c5e into openshift:master Mar 15, 2023
@openshift-ci-robot
Copy link
Contributor

@yuqi-zhang: Jira Issue OCPBUGS-8446: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-8446 has been moved to the MODIFIED state.

In response to this:

Have a new routine that watches for controllerconfig changes to read the latest kubelet ca bundle, and lay that down to disk directly. File writing and validation will skip over it.

This is still WIP but the functionality should work. You can try this by pausing the pools and forcing a rotation of any of the certificates in the bundle. Although it won't be instant, the MCD will sync it once the safety controllerconfig sync happens.

I tried doing this via an additional configmap but it feels like this is a bit cleaner, and allows the MCD to have limited permissions in what it can read.

  • What I did
  1. Add a new controllerconfig watcher + sync to directly write the newest kubelet ca bundle directly to disk
  2. Remove MCD file writing and verification for that file
  3. Add a test to write the file
  4. Remove (now incorrect) cert metric and alert
  5. Add additional path in the MCS to serve the latest bundle as well
  • How to verify it
  1. Install a 4.13 cluster without this change and pause pools, in ~24 hours you will get kubeletdown critical alerts. With this change, you should no longer see it
  2. Faster way to verify: pause pools, manually expire a cert (e.g. kube-apiserver-to-kubelet-signer), and see the change propagate to disk without unpause

Note that this is only a temporary 4.13 solution, with future steps outlined in https://issues.redhat.com/browse/MCO-499 for a more comprehensive solution

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@yuqi-zhang: new pull request created: #3612

In response to this:

Thanks!

/cherry-pick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants